Overview

Column

Overview

The Problem


Climate change is a very big problem in contemporary times, and India, as the second most populous country in the world, contributes significantly to air pollution with many cities in the top 50 most polluted cities in the world. But the extent of pollution is not properly studied and used in decision making in India. Air quality monitoring stations are present in many cities but the data isn’t used for predictions in most cases, but rather as an observation. Another problem is that these AQI monitoring stations aren’t present in small cities. The project aims to build a forecasting model to predict AQI. The project focuses on the city of Indian cities like Delhi, Mumbai and Chennai which are infamous for their pollution levels.


India NOx pollution

Column

Pollution level by continents and income (PM 10)

Polution levels in different countries (PM 10)

Data Summary

Column

Meteorological data

Meteorological data


The Meteorological data was scrapped from tutiempo.com. The coding for the scrapper is given with the file.The values collected were Temperature, Purssure, Humidity and Wind Speed.

AQI Data

AQI Data


The AQI data was downloaded from Kaggle. The data set has AQI, CO, NO, NOx, NH3, SO2, PM2.5, PM10 emissions measured in micro grams per meter cube.

World pollution Data

World pollution Data


This data was downloaded from WHO Global Ambient Air Quality Database - update 2018.It had locationwise pollution data for countries

Column

India pollution Data

India pollution Data


The AQI data was manipulated and grouped to obtain mean pollution across various cities in India.

Latitude Longitude countries

Latitude Longitude countries


This Data was downloaded from Kaggle.Since goggle’s ggmap is no more free. This was a better option than buying an API.

Latitude Longitude Indian cities

Latitude Longitude Indian cities


This data had location (geocode) of different Indian cities. This is matched with existing pollution data for visualizations.

World Pollution Levels

Column

Plot Inference

Pollution Levels around the world

The plots on the side reveal how developing countries (Low Income Mediterranean, South East Asia) contribute most to the particulate pollution. Moreover the other plots reveal how India compares to the rest of the world

LMIC: Low Middle Income Countries

HIC: High Income Countries

PM 2.5 across countries mapped

PM 10 across countries mapped

Column

PM 10 across continents

PM 2.5 across continents

India Pollution Levels

Column

Plot Inference

Pollution Levels in India

The plots on the side show the amounts of pollutants and their spread across various cities in India. Particulate Matter is not shown here. The pollutants shown here make up significant portion of the Air Quality Index which is a widely used measure. We will be using the AQI as well in this project


CO

NO2

Column

NOx

NO

NH3

Delhi

Column

Delhi

Temperature

Column

Pressure

Humidity

Wind

Mumbai

Column

Mumbai

Temperature

Column

Pressure

Humidity

Wind

Chennai

Column

Chennai

Temperature

Column

Pressure

Humidity

Wind

Methodology

Column

ACF/ PACF

Meteorological data


By looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots we can tentatively identify the numbers of AR and/or MA terms that are needed.ACF plot is merely a bar chart of the coefficients of correlation between a time series and lags of itself. The PACF plot is a plot of the partial correlation coefficients between the series and lags of itself.

Model Fitting

AQI Data


We have used the algorithm proposed by Hyndman & Khandakarin 2008 which is implemented through auto.arima in R. This simplifies the process but increases time complexity. Since we have a relatively small data set we have gone ahead with this.

SARIMA

SARIMA


SARIMA is a Integrated AR, MA model fitted for daily/monthly values as well as an additional seasonal term. We have fitted the model using auto.arima.

Column

SARIMAX

SARIMAX


The SARIMAX model has some regressors as well, to model the variable change with time. This model was fitted using auto.arima but the results were unfavorable due to missing values and compounding of errors.A Distributed Lag model was also checked but gave worse results.

Ljung Box Test

Ljung Box Test


The Ljung Box test of the residuals show if the residuals are white noise or if they have predictability. The squared residual plot checks conditional heteroscedasticity.

Model Selection

Model Selections


There are various error measures that can be used for model selection - AIC, BIC, MSE, MAE, MAPE, etc. We have used MAPE. We have chosen the most robust model as that is also an important criteria in selection.

ACF/PACF

### Importance

ACF Delhi

PACF Delhi

ACF Mumbai

PACF Mumbai

ACF Chennai

PACF Chennai

SARIMA

Column

Summary

Fitted Models

Delhi: SARIMA(1,1,2)(0,1,0)
Mumbai: SARIMA(1,0,0)(0,1,0)
Chennai: SARIMA(4,1,3)

The Chennai model has a lot of missing values and hence cannot be predicted properly, hence the apparent lack of seasonality. We drop Chennai from our analysis.

Delhi

Column

Mumbai

Chennai

Model Results SARIMA

### Importance

Delhi

[1] "Delhi"
Series: AQID.ts 
ARIMA(1,1,2)(0,1,0)[365] 

Coefficients:
         ar1      ma1      ma2
      0.5423  -0.6921  -0.2647
s.e.  0.0347   0.0383   0.0334

sigma^2 estimated as 4716:  log likelihood=-9282.91
AIC=18573.83   AICc=18573.85   BIC=18595.45

Training set error measures:
                     ME     RMSE      MAE       MPE     MAPE      MASE
Training set -0.7725529 62.04998 41.97265 -4.096785 20.52469 0.5027082
                     ACF1
Training set -0.003184295

Mumbai

[1] "Mumbai"
Series: AQIM.ts 
ARIMA(1,0,0)(0,1,0)[365] with drift 

Coefficients:
         ar1    drift
      0.7150  -0.0262
s.e.  0.0339   0.0121

sigma^2 estimated as 675.2:  log likelihood=-1972.82
AIC=3951.64   AICc=3951.7   BIC=3963.78

Training set error measures:
                     ME     RMSE      MAE      MPE     MAPE      MASE
Training set 0.01404403 18.98299 9.353563 -1.96953 9.687569 0.3399684
                   ACF1
Training set 0.02590205

Chennai

[1] "Chennai"
Series: AQIC.ts 
ARIMA(4,1,3) 

Coefficients:
         ar1      ar2     ar3      ar4      ma1    ma2      ma3
      1.5983  -0.8616  0.1788  -0.0069  -1.9292  1.162  -0.2259
s.e.  0.2212      NaN     NaN   0.0138   0.2212    NaN      NaN

sigma^2 estimated as 1228:  log likelihood=-9579.86
AIC=19175.72   AICc=19175.8   BIC=19220.23

Training set error measures:
                   ME    RMSE      MAE      MPE     MAPE      MASE         ACF1
Training set -1.28413 34.9666 22.85284 -7.48814 20.17663 0.4631568 -0.001062613

SARIMAX

Column

Summary

Fitted Models

Delhi: SARIMAX(1,0,0)
Mumbai: SARIMA(2,1,2)
Chennai: SARIMA(1,1,1)

These ARIMAX models are done using all the meteorological variables. This gives models with larger MAPE than the ARIMA model without an X variable/ matrix. This can be due to high deviations in the meteorological variables. A distributed lagged model was also fitted to check. ARIMA with no X variable is the best in any case.

Delhi

Column

Mumbai

Chennai

Model Results SARIMAX

### Importance

Delhi

[1] "Delhi"
Series: ydl 
Regression with ARIMA(1,0,0) errors 

Coefficients:
         ar1    temp  pressure  humidity    wind
      0.8746  1.8201    0.0421    0.2311  0.3091
s.e.  0.0191  0.5682    0.0214    0.1363  0.3253

sigma^2 estimated as 426.9:  log likelihood=-3498.19
AIC=7008.37   AICc=7008.48   BIC=7036.38

Training set error measures:
                    ME     RMSE     MAE       MPE     MAPE     MASE       ACF1
Training set 0.0194588 20.59614 13.4242 -3.261841 12.46601 1.011254 0.03280456

Mumbai

[1] "Mumbai"
Series: ydl 
Regression with ARIMA(2,1,2) errors 

Coefficients:
         ar1      ar2      ma1     ma2     temp  pressure  humidity    wind
      1.0552  -0.4096  -1.2309  0.3260  -1.2183    0.3911   -0.2039  0.0950
s.e.  0.1460   0.0964   0.1531  0.1417   1.0391    0.7103    0.1454  0.4229

sigma^2 estimated as 390.3:  log likelihood=-3456.66
AIC=6931.33   AICc=6931.56   BIC=6973.33

Training set error measures:
                      ME     RMSE      MAE       MPE     MAPE      MASE
Training set -0.07747973 19.64383 12.64135 -2.087419 11.39747 0.9522809
                     ACF1
Training set -0.001610694

Chennai

[1] "Chennai"
Series: ydl 
Regression with ARIMA(1,1,1) errors 

Coefficients:
         ar1      ma1     temp  pressure  humidity     wind
      0.6278  -0.9694  -1.2233   -0.2736    0.1754  -0.0954
s.e.  0.0248   0.0098   1.3770    0.8662    0.2518   0.5131

sigma^2 estimated as 1406:  log likelihood=-7557.43
AIC=15128.87   AICc=15128.94   BIC=15166.05

Training set error measures:
                    ME     RMSE      MAE       MPE    MAPE      MASE
Training set -1.553327 37.40857 25.00667 -8.123409 21.2297 0.9668808
                     ACF1
Training set 0.0009152456

Ljung - Box Residual Test

### Importance

Delhi


    Box-Ljung test

data:  ARIMAXD.out$residuals
X-squared = 6.5218, df = 10, p-value = 0.7697

    Box-Ljung test

data:  ARIMAXD.out$residuals^2
X-squared = 49.033, df = 1, p-value = 2.517e-12

Mumbai


    Box-Ljung test

data:  ARIMAXM.out$residuals
X-squared = 15.527, df = 10, p-value = 0.114

    Box-Ljung test

data:  ARIMAXM.out$residuals^2
X-squared = 125.9, df = 1, p-value < 2.2e-16

Chennai


    Box-Ljung test

data:  ARIMAXC.out$residuals
X-squared = 14.726, df = 10, p-value = 0.1424

    Box-Ljung test

data:  ARIMAXC.out$residuals^2
X-squared = 142.62, df = 1, p-value < 2.2e-16

Comparison & Results

Column

Conclusion

Conclusion

In this project we have identified the SARIMA function using Hyndman & Khandakar algorithm in auto.arima to be the best model for our problem. Due to the limitations of computing power and relatively low quality of data, we were only able to forecast for Delhi and Mumbai. With distributed computing, this can be overcome. The model can be automated and scaled with better quality data to make an interactive dashboard for policy makers to make decisions and to track climate change and fluctuations. Technologies like HDFS, Map Reduce with R can be on a cloud service provider used for scaling.


Thank You

Delhi Final Prediction

Column

Limitations

Limitations

The Ljung Box test confirmed conditional heteroscedasticity. GARCH models were fitted but the standard deviations were too high. The very nature of this data is highly volatile. Therefore we have obtained a reasonable prediction. Usage of a fourier curve for the seasonal component or modern techniques like LSTM Neural Nets might give better results.

Mumbai Final Prediction

Web Scrapper

### Importance

Page 1

# This code was used to scrape meterological data from the web
data<-NULL
for ( year in 2019:2020)
{
  for(month in 1:12)
  {
    monthname<-c('january','february','march','april','may','june','july','august','september','october','november','december')
    if (monthname[month] %in% c('january','march','may','july','august','october','december')){j=31}
           else {j=30}
    if(month==2&year%%4==0){j=29}
    else if (month==2&year%%4!=0){j=28}
    for(day in 1:j)
    {
      url<-read_html(paste0('https://en.tutiempo.net/records/vidp/',day,'-',monthname[month],'-',year,'.html'))
temp.raw<-url%>%
  html_nodes('.Temp')%>%
  html_text()
temp.raw<-gsub('[^\x20-\x7E]', '', temp.raw)
humid.raw<-url%>%
  html_nodes('.hr')%>%
  html_text()
humid.raw<-str_remove(humid.raw,'%')
wind.raw<-url%>%
  html_nodes('.wind')%>%
  html_text()

Page 2

wind.raw<-str_remove(wind.raw,' km/h')
wind.raw<-na.omit(str_extract(wind.raw, '[[:digit:]]+'))
pressure.raw<-url%>%
  html_nodes('.prob')%>%
  html_text()

pressure.raw<-str_remove(pressure.raw,'hPa')
humidity<-mean(na.omit(as.integer(humid.raw)))
temp<-mean(na.omit(as.integer(temp.raw)))
pressure<-mean(na.omit(as.integer(pressure.raw)))
wind<-mean(na.omit(as.integer(wind.raw)))
Date<-as.Date(paste0(year,'-',month,'-',day))
datatemp<-data.frame(Date,temp,pressure,humidity,wind)
data<-rbind(data,datatemp)
    }  
    
}
  nametemp<-paste('data',year,sep='_')
  assign(nametemp,data)
  
}

data2<-rbind(data_2015,data_2016,data_2017,data_2018,data_2019,data_2020)

write.csv(data2,'data2.csv', row.names = FALSE)

Credits

Column

Group No

Group No

28

Member

Wellington Daniel

2016IPM118

Member

Oshin Singhal

2016IPM070

Column

Section

Section

A

Member

Shashank Giri

2016IPM094

Member

Aditya Badonia

2016IPM004

---
title: 'Air Quality Prediction'
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    theme: flatly
    social: [ "twitter", "facebook", "menu" ]
    source_code: embed
        

---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(htmltab)
library(tsbox)
library(stR)
library(quantmod)
library(tseries)
library(forecast)
library(dygraphs)
library(leaflet)
library(ggplot2)
library(plotly)
library(readxl)


data <- read.csv('~/R/AQI prediction/data2.csv')
data$Date<-as.Date(data$Date)
AQI <- read.csv('~/R/AQI prediction/Data/city_day.csv')


AQID<-dplyr::filter(AQI,City=="Delhi")
AQIM<-dplyr::filter(AQI,City=="Mumbai")
AQIC<-dplyr::filter(AQI,City=="Chennai")

AQID<-AQID[,c("Date","AQI")]
AQID$Date<-as.Date(AQID$Date,format="%d-%m-%Y")
AQIM<-AQIM[,c("Date","AQI")]
AQIM$Date<-as.Date(AQIM$Date,format="%d-%m-%Y")
AQIC<-AQIC[,c("Date","AQI")]
AQIC$Date<-as.Date(AQIC$Date,format="%d-%m-%Y")

AQID.ts<-ts(AQID$AQI,start=c(2015,1),frequency=365)
AQID.ts<-na.approx(AQID.ts)
AQIM.ts<-ts(AQIM$AQI,start=c(2015,1),frequency=365)
AQIM.ts<-na.approx(AQIM.ts)
AQIC.ts<-ts(AQIC$AQI,start=c(2015,1),frequency=365)
AQIC.ts<-na.approx(AQIC.ts)

```



```{r , include=FALSE}
worldPM1025 <- read_excel("aap_air_quality_database_2018_v14.xlsx",  sheet = "database", 
                          col_types = c("text","skip", "text", "skip", "skip", "text", "skip", "skip", 
                                        "text", "skip", "skip", "skip", "skip", "skip", "skip"), skip = 2)

#Source: WHO Global Ambient Air Quality Database - update 2018

#Preprocessing
names(worldPM1025)<-c("Region", "Country","PM10","PM2.5")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"-converted value","")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"[)]","")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"[(]","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"-converted value","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"[)]","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"[(]","")
worldPM1025$PM2.5<-as.numeric(worldPM1025$PM2.5)
worldPM1025$PM10<-as.numeric(worldPM1025$PM10)
World2.5<-aggregate(x=worldPM1025$PM2.5,by=list(Country=worldPM1025$Country,Region=worldPM1025$Region),FUN = mean)
World10<-aggregate(x=worldPM1025$PM10,by=list(Country=worldPM1025$Country,Region=worldPM1025$Region),FUN = mean)
          

#Source:https://www.kaggle.com/paultimothymooney/latitude-and-longitude-for-every-country-and-state
countries <- read.csv("~/countries.csv")
names(countries)<-c("code","lat","lon","Country")

mergeddata2.5<-merge(World2.5,countries,type="left")
mergeddata10<-merge(World10,countries,type="left")

# Create a palette that maps factor levels to colors
pal <- colorNumeric(palette = "YlOrRd", domain = mergeddata2.5$x)
pal2 <- colorNumeric(palette = "YlOrRd", domain = mergeddata10$x)

```


```{r, include = FALSE}

o=data.frame(na.omit(AQI[,c(1,5:9)]))
fin<-aggregate(o,by=list(o$City),FUN=mean)
fin<-na.omit(fin[,-2])
names(fin)=c("city","NO","NO2","NOx","NH3","CO")


indlatlon <- read.csv("~/in.csv")

indiaf<-merge(fin,indlatlon,type="inner",by="city")

```


Overview {data-navmenu="Intro"}
===================================== 


Column
-------------------------------------
    
### Overview

The Problem


Climate change is a very big problem in contemporary times, and India, as the second most populous country in the world, contributes significantly to air pollution with many cities in the top 50 most polluted cities in the world. But the extent of pollution is not properly studied and used in decision making in India. Air quality monitoring stations are present in many cities but the data isn’t used for predictions in most cases, but rather as an observation. Another problem is that these AQI monitoring stations aren’t present in small cities. The project aims to build a forecasting model to predict AQI. The project focuses on the city of Indian cities like Delhi, Mumbai and Chennai which are infamous for their pollution levels.


### India NOx pollution ```{r} indiafplot3<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NOx) pal5 <- colorNumeric(palette = "YlOrRd", domain = indiafplot3$x) leaflet(indiafplot3) %>% addTiles() %>% addCircleMarkers( radius = ~x, color = ~pal5(x), stroke = FALSE, fillOpacity = 0.5 ) ``` Column ------------------------------------- ### Pollution level by continents and income (PM 10) ```{r} library(plotly) fig <- plot_ly(World10, y = ~x, color = ~Region, type = "box") fig ``` ### Polution levels in different countries (PM 10) ```{r} leaflet(mergeddata10) %>% addTiles() %>% addCircleMarkers( radius = ~x/5, color = ~pal2(x), stroke = FALSE, fillOpacity = 0.5 ) ``` Data Summary {data-navmenu="Intro"} ===================================== Column ------------------------------------- ### Meteorological data

Meteorological data


The Meteorological data was scrapped from tutiempo.com. The coding for the scrapper is given with the file.The values collected were Temperature, Purssure, Humidity and Wind Speed.

### AQI Data

AQI Data


The AQI data was downloaded from Kaggle. The data set has AQI, CO, NO, NOx, NH3, SO2, PM2.5, PM10 emissions measured in micro grams per meter cube.

### World pollution Data

World pollution Data


This data was downloaded from WHO Global Ambient Air Quality Database - update 2018.It had locationwise pollution data for countries

Column ------------------------------------- ### India pollution Data

India pollution Data


The AQI data was manipulated and grouped to obtain mean pollution across various cities in India.

### Latitude Longitude countries

Latitude Longitude countries


This Data was downloaded from Kaggle.Since goggle's ggmap is no more free. This was a better option than buying an API.

### Latitude Longitude Indian cities

Latitude Longitude Indian cities


This data had location (geocode) of different Indian cities. This is matched with existing pollution data for visualizations.

World Pollution Levels {data-navmenu="EDA"} ========================================= Column ------------------------------------- ### Plot Inference

Pollution Levels around the world

The plots on the side reveal how developing countries (Low Income Mediterranean, South East Asia) contribute most to the particulate pollution. Moreover the other plots reveal how India compares to the rest of the world

LMIC: Low Middle Income Countries

HIC: High Income Countries

### PM 2.5 across countries mapped ```{r} leaflet(mergeddata2.5) %>% addTiles() %>% addCircleMarkers( radius = ~x/5, color = ~pal(x), stroke = FALSE, fillOpacity = 0.5 ) ``` ### PM 10 across countries mapped ```{r} leaflet(mergeddata10) %>% addTiles() %>% addCircleMarkers( radius = ~x/5, color = ~pal2(x), stroke = FALSE, fillOpacity = 0.5 ) ``` Column ------------------------------------- ### PM 10 across continents ```{r} library(plotly) fig <- plot_ly(World10, y = ~x, color = ~Region, type = "box") fig ``` ### PM 2.5 across continents ```{r} library(plotly) fig1 <- plot_ly(World2.5, y = ~x, color = ~Region, type = "box") fig1 ``` India Pollution Levels{ data-navmenu="EDA"} ========================================= Column ------------------------------------- ### Plot Inference

Pollution Levels in India

The plots on the side show the amounts of pollutants and their spread across various cities in India. Particulate Matter is not shown here. The pollutants shown here make up significant portion of the Air Quality Index which is a widely used measure. We will be using the AQI as well in this project


### CO ```{r} indiafplot1<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$CO) pal3 <- colorNumeric(palette = "YlOrRd", domain = indiafplot1$x) leaflet(indiafplot1) %>% addTiles() %>% addCircleMarkers( radius = ~x*15, color = ~pal3(x), stroke = FALSE, fillOpacity = 0.5 ) ``` ### NO2 ```{r} indiafplot2<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NO2) pal4 <- colorNumeric(palette = "YlOrRd", domain = indiafplot2$x) leaflet(indiafplot2) %>% addTiles() %>% addCircleMarkers( radius = ~x, color = ~pal4(x), stroke = FALSE, fillOpacity = 0.5 ) ``` Column ------------------------------------- ### NOx ```{r} indiafplot3<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NOx) pal5 <- colorNumeric(palette = "YlOrRd", domain = indiafplot3$x) leaflet(indiafplot3) %>% addTiles() %>% addCircleMarkers( radius = ~x, color = ~pal5(x), stroke = FALSE, fillOpacity = 0.5 ) ``` ### NO ```{r} indiafplot4<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NO) pal6 <- colorNumeric(palette = "YlOrRd", domain = indiafplot4$x) leaflet(indiafplot4) %>% addTiles() %>% addCircleMarkers( radius = ~x/2, color = ~pal6(x), stroke = FALSE, fillOpacity = 0.5 ) ``` ### NH3 ```{r} indiafplot5<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NH3) pal7 <- colorNumeric(palette = "YlOrRd", domain = indiafplot5$x) leaflet(indiafplot5) %>% addTiles() %>% addCircleMarkers( radius = ~x/2, color = ~pal7(x), stroke = FALSE, fillOpacity = 0.5 ) ``` Delhi{data-navmenu="Meterological Trends"} ========================================= Column ------------------------------------- ### Delhi ```{r} m <- leaflet() %>% addTiles() %>% addMarkers(lng=77.1025, lat=28.7041, popup="Delhi") m ``` ### Temperature ```{r} datadelhi <- read.csv("datadelhi.csv") datadelhi$Date<-as.Date(datadelhi$Date) x <- datadelhi$Date y<-datadelhi$temp fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Degrees")) fig ``` Column ------------------------------------- ### Pressure ```{r} x <- datadelhi$Date y<-datadelhi$pressure fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("hPa")) fig ``` ### Humidity ```{r} x <- datadelhi$Date y<-datadelhi$humidity[datadelhi$humidity<1000] fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("%")) fig ``` ### Wind ```{r} x <- datadelhi$Date y<-datadelhi$wind fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Km/h")) fig ``` Mumbai{data-navmenu="Meterological Trends"} ========================================= Column ------------------------------------- ### Mumbai ```{r} m <- leaflet() %>% addTiles() %>% addMarkers(lng=72.8777, lat=19.0760, popup="Mumbai") m ``` ### Temperature ```{r} datamumbai <- read.csv("datamumbai.csv") datamumbai$Date<-as.Date(datamumbai$Date) x <- datamumbai$Date y<-datamumbai$temp fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Degrees")) fig ``` Column ------------------------------------- ### Pressure ```{r} x <- datamumbai$Date[datamumbai$pressure<1100] y<-datamumbai$pressure[datamumbai$pressure<1100] fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("hPa")) fig ``` ### Humidity ```{r} x <- datamumbai$Date y<-datamumbai$humidity fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("%")) fig ``` ### Wind ```{r} x <- datamumbai$Date y<-datamumbai$wind fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Km/h")) fig ``` Chennai{data-navmenu="Meterological Trends"} ========================================= Column ------------------------------------- ### Chennai ```{r} m <- leaflet() %>% addTiles() %>% addMarkers(lng=80.2707, lat=13.0827, popup="Chennai") m ``` ### Temperature ```{r} datachennai <- read.csv("datachennai.csv") datachennai$Date<-as.Date(datachennai$Date) x <- datachennai$Date y<-datachennai$temp fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Degrees")) fig ``` Column ------------------------------------- ### Pressure ```{r} x <- datachennai$Date y<-datachennai$pressure fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("hPa")) fig ``` ### Humidity ```{r} x <- datachennai$Date y<-datachennai$humidity fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("%")) fig ``` ### Wind ```{r} x <- datachennai$Date[datachennai$wind<22 ] y<-datachennai$wind[datachennai$wind<22 ] fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Km/h")) fig ``` Methodology {data-navmenu="Time Series Analysis"} ========================================= Column ------------------------------------- ### ACF/ PACF

Meteorological data


By looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots we can tentatively identify the numbers of AR and/or MA terms that are needed.ACF plot is merely a bar chart of the coefficients of correlation between a time series and lags of itself. The PACF plot is a plot of the partial correlation coefficients between the series and lags of itself.

### Model Fitting

AQI Data


We have used the algorithm proposed by Hyndman & Khandakarin 2008 which is implemented through auto.arima in R. This simplifies the process but increases time complexity. Since we have a relatively small data set we have gone ahead with this.

### SARIMA

SARIMA


SARIMA is a Integrated AR, MA model fitted for daily/monthly values as well as an additional seasonal term. We have fitted the model using auto.arima.

Column ------------------------------------- ### SARIMAX

SARIMAX


The SARIMAX model has some regressors as well, to model the variable change with time. This model was fitted using auto.arima but the results were unfavorable due to missing values and compounding of errors.A Distributed Lag model was also checked but gave worse results.

### Ljung Box Test

Ljung Box Test


The Ljung Box test of the residuals show if the residuals are white noise or if they have predictability. The squared residual plot checks conditional heteroscedasticity.

### Model Selection

Model Selections


There are various error measures that can be used for model selection - AIC, BIC, MSE, MAE, MAPE, etc. We have used MAPE. We have chosen the most robust model as that is also an important criteria in selection.

ACF/PACF {data-navmenu="Time Series Analysis"} ========================================= ### Importance {.tabset} ------------------------------------- ### ACF Delhi ```{r} conf.level <- 0.95 ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQID.ts)) bacf <- pacf(AQID.ts, plot = FALSE) bacfdf <- with(bacf, data.frame(lag, acf)) library(ggplot2) q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) + geom_bar(stat = "identity", position = "identity") q ``` ### PACF Delhi ```{r} conf.level <- 0.95 ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQID.ts)) bacf <- acf(AQID.ts, plot = FALSE) bacfdf <- with(bacf, data.frame(lag, acf)) library(ggplot2) q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) + geom_bar(stat = "identity", position = "identity") q ``` ### ACF Mumbai ```{r} conf.level <- 0.95 ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIM.ts)) bacf <- pacf(AQIM.ts, plot = FALSE) bacfdf <- with(bacf, data.frame(lag, acf)) library(ggplot2) q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) + geom_bar(stat = "identity", position = "identity") q ``` ### PACF Mumbai ```{r} conf.level <- 0.95 ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIM.ts)) bacf <- acf(AQIM.ts, plot = FALSE) bacfdf <- with(bacf, data.frame(lag, acf)) library(ggplot2) q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) + geom_bar(stat = "identity", position = "identity") q ``` ### ACF Chennai ```{r} conf.level <- 0.95 ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIC.ts)) bacf <- pacf(AQIC.ts, plot = FALSE) bacfdf <- with(bacf, data.frame(lag, acf)) library(ggplot2) q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) + geom_bar(stat = "identity", position = "identity") q ``` ### PACF Chennai ```{r} conf.level <- 0.95 ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIC.ts)) bacf <- acf(AQIC.ts, plot = FALSE) bacfdf <- with(bacf, data.frame(lag, acf)) library(ggplot2) q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) + geom_bar(stat = "identity", position = "identity") q ``` SARIMA {data-navmenu="Time Series Analysis"} ========================================= Column ------------------------------------- ### Summary

Fitted Models

Delhi: SARIMA(1,1,2)(0,1,0)
Mumbai: SARIMA(1,0,0)(0,1,0)
Chennai: SARIMA(4,1,3)

The Chennai model has a lot of missing values and hence cannot be predicted properly, hence the apparent lack of seasonality. We drop Chennai from our analysis.

### Delhi ```{r} set.seed(1000) ARIMAXD.out=auto.arima(y = AQID.ts) forcastD<-forecast(object = ARIMAXD.out,h = 365) Time0 <-time(forcastD$mean) AQIDelhi0<-as.numeric(forcastD$mean) figD <- plot_ly(x = ~Time0, y = ~AQIDelhi0, mode = 'lines',type = "scatter") #forcastD library(plotly) trace1 <- list( line = list( color = "rgba(0,0,0,1)", fillcolor = "rgba(0,0,0,1)" ), mode = "lines", name = "observed", type = "scatter", x = time(forcastD$fitted), y = as.numeric(forcastD$fitted), xaxis = "x", yaxis = "y" ) trace2 <- list( fill = "toself", line = list( color = "rgba(242,242,242,1)", fillcolor = "rgba(242,242,242,1)" ), mode = "lines", name = "95% confidence", type = "scatter", x = append(time(forcastD$upper),(append((rev(time(forcastD$upper))[1]),rev(time(forcastD$upper))))), y = append(as.numeric(forcastD$lower[,2]),append((rev(as.numeric(forcastD$upper[,2]))[1]),rev(as.numeric(forcastD$upper[,2])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace3 <- list( fill = "toself", line = list( color = "rgba(204,204,204,1)", fillcolor = "rgba(204,204,204,1)" ), mode = "lines", name = "80% confidence", type = "scatter", x = append(time(forcastD$upper),(append((rev(time(forcastD$upper))[1]),rev(time(forcastD$upper))))), y = append(as.numeric(forcastD$lower[,1]),append((rev(as.numeric(forcastD$upper[,1]))[1]),rev(as.numeric(forcastD$upper[,1])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace4 <- list( line = list( color = "rgba(0,0,255,1)", fillcolor = "rgba(0,0,255,1)" ), mode = "lines", name = "prediction", type = "scatter", x = time(forcastD$mean), y = as.numeric(forcastD$mean), xaxis = "x", yaxis = "y" ) data <- list(trace1, trace2, trace3, trace4) layout <- list( title = "Forecast from Auto.Arima", xaxis = list( title = "Year", domain = c(0, 1) ), yaxis = list( title = "SAR", domain = c(0, 1) ), margin = list( b = 40, l = 60, r = 10, t = 25 ) ) p <- plot_ly() p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis) p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron) p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron) p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis) p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin) p ``` Column ------------------------------------- ### Mumbai ```{r} set.seed(1000) ARIMAXM.out=auto.arima(y = AQIM.ts) forcastM<-forecast(object = ARIMAXM.out,h = 365) Time1<-time(forcastM$mean) AQIMumbai0<-as.numeric(forcastM$mean) figM <- plot_ly(x = ~Time1, y = ~AQIMumbai0, mode = 'lines',type = "scatter") #forcastM library(plotly) trace1 <- list( line = list( color = "rgba(0,0,0,1)", fillcolor = "rgba(0,0,0,1)" ), mode = "lines", name = "observed", type = "scatter", x = time(forcastM$fitted), y = as.numeric(forcastM$fitted), xaxis = "x", yaxis = "y" ) trace2 <- list( fill = "toself", line = list( color = "rgba(242,242,242,1)", fillcolor = "rgba(242,242,242,1)" ), mode = "lines", name = "95% confidence", type = "scatter", x = append(time(forcastM$upper),(append((rev(time(forcastM$upper))[1]),rev(time(forcastM$upper))))), y = append(as.numeric(forcastM$lower[,2]),append((rev(as.numeric(forcastM$upper[,2]))[1]),rev(as.numeric(forcastM$upper[,2])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace3 <- list( fill = "toself", line = list( color = "rgba(204,204,204,1)", fillcolor = "rgba(204,204,204,1)" ), mode = "lines", name = "80% confidence", type = "scatter", x = append(time(forcastM$upper),(append((rev(time(forcastM$upper))[1]),rev(time(forcastM$upper))))), y = append(as.numeric(forcastM$lower[,1]),append((rev(as.numeric(forcastM$upper[,1]))[1]),rev(as.numeric(forcastM$upper[,1])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace4 <- list( line = list( color = "rgba(0,0,255,1)", fillcolor = "rgba(0,0,255,1)" ), mode = "lines", name = "prediction", type = "scatter", x = time(forcastM$mean), y = as.numeric(forcastM$mean), xaxis = "x", yaxis = "y" ) data <- list(trace1, trace2, trace3, trace4) layout <- list( title = "Forecast from Auto.Arima", xaxis = list( title = "Year", domain = c(0, 1) ), yaxis = list( title = "SAR", domain = c(0, 1) ), margin = list( b = 40, l = 60, r = 10, t = 25 ) ) p <- plot_ly() p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis) p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron) p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron) p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis) p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin) p ``` ### Chennai ```{r} set.seed(1000) ARIMAXC.out=auto.arima(y = AQIC.ts) forcastC<-forecast(object = ARIMAXC.out,h = 365) x1 <-seq( from =as.Date("2020-01-08"),to = as.Date("2021-01-08"),length.out = 365) y1<-as.numeric(forcastC$mean) fig <- plot_ly(x = ~x1, y = ~y1, mode = 'lines',type = "scatter") #forcastC library(plotly) trace1 <- list( line = list( color = "rgba(0,0,0,1)", fillcolor = "rgba(0,0,0,1)" ), mode = "lines", name = "observed", type = "scatter", x = time(forcastC$fitted), y = as.numeric(forcastC$fitted), xaxis = "x", yaxis = "y" ) trace2 <- list( fill = "toself", line = list( color = "rgba(242,242,242,1)", fillcolor = "rgba(242,242,242,1)" ), mode = "lines", name = "95% confidence", type = "scatter", x = append(time(forcastC$upper),(append((rev(time(forcastC$upper))[1]),rev(time(forcastC$upper))))), y = append(as.numeric(forcastC$lower[,2]),append((rev(as.numeric(forcastC$upper[,2]))[1]),rev(as.numeric(forcastC$upper[,2])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace3 <- list( fill = "toself", line = list( color = "rgba(204,204,204,1)", fillcolor = "rgba(204,204,204,1)" ), mode = "lines", name = "80% confidence", type = "scatter", x = append(time(forcastC$upper),(append((rev(time(forcastC$upper))[1]),rev(time(forcastC$upper))))), y = append(as.numeric(forcastC$lower[,1]),append((rev(as.numeric(forcastC$upper[,1]))[1]),rev(as.numeric(forcastC$upper[,1])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace4 <- list( line = list( color = "rgba(0,0,255,1)", fillcolor = "rgba(0,0,255,1)" ), mode = "lines", name = "prediction", type = "scatter", x = time(forcastC$mean), y = as.numeric(forcastC$mean), xaxis = "x", yaxis = "y" ) data <- list(trace1, trace2, trace3, trace4) layout <- list( title = "Forecast from Auto.Arima", xaxis = list( title = "Year", domain = c(0, 1) ), yaxis = list( title = "SAR", domain = c(0, 1) ), margin = list( b = 40, l = 60, r = 10, t = 25 ) ) p <- plot_ly() p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis) p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron) p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron) p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis) p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin) p ``` Model Results SARIMA { data-navmenu="Time Series Analysis"} =================================================================== ### Importance {.tabset} ------------------------------------- ### Delhi ```{r} print("Delhi") summary(ARIMAXD.out) ``` ### Mumbai ```{r} print("Mumbai") summary(ARIMAXM.out) ``` ### Chennai ```{r} print("Chennai") summary(ARIMAXC.out) ``` SARIMAX {data-navmenu="Time Series Analysis"} ========================================= Column ------------------------------------- ### Summary

Fitted Models

Delhi: SARIMAX(1,0,0)
Mumbai: SARIMA(2,1,2)
Chennai: SARIMA(1,1,1)

These ARIMAX models are done using all the meteorological variables. This gives models with larger MAPE than the ARIMA model without an X variable/ matrix. This can be due to high deviations in the meteorological variables. A distributed lagged model was also fitted to check. ARIMA with no X variable is the best in any case.

### Delhi ```{r} datadelhi <- read.csv("datadelhi.csv") datadelhi$Date<-as.Date(datadelhi$Date) ydl=as.ts(AQIM.ts[1:1500]) xdl=as.matrix(datadelhi[1:1500,-1]) ARIMAXD2.out=auto.arima(y = ydl, xreg = xdl) forcastdlD<-forecast(object = ARIMAXD2.out,h = 365,xreg=as.matrix(datadelhi[1501:1865,-1])) #forcastdlD library(plotly) trace1 <- list( line = list( color = "rgba(0,0,0,1)", fillcolor = "rgba(0,0,0,1)" ), mode = "lines", name = "observed", type = "scatter", x = time(forcastdlD$fitted), y = as.numeric(forcastdlD$fitted), xaxis = "x", yaxis = "y" ) trace2 <- list( fill = "toself", line = list( color = "rgba(242,242,242,1)", fillcolor = "rgba(242,242,242,1)" ), mode = "lines", name = "95% confidence", type = "scatter", x = append(time(forcastdlD$upper),(append((rev(time(forcastdlD$upper))[1]),rev(time(forcastdlD$upper))))), y = append(as.numeric(forcastdlD$lower[,2]),append((rev(as.numeric(forcastdlD$upper[,2]))[1]),rev(as.numeric(forcastdlD$upper[,2])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace3 <- list( fill = "toself", line = list( color = "rgba(204,204,204,1)", fillcolor = "rgba(204,204,204,1)" ), mode = "lines", name = "80% confidence", type = "scatter", x = append(time(forcastdlD$upper),(append((rev(time(forcastdlD$upper))[1]),rev(time(forcastdlD$upper))))), y = append(as.numeric(forcastdlD$lower[,1]),append((rev(as.numeric(forcastdlD$upper[,1]))[1]),rev(as.numeric(forcastdlD$upper[,1])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace4 <- list( line = list( color = "rgba(0,0,255,1)", fillcolor = "rgba(0,0,255,1)" ), mode = "lines", name = "prediction", type = "scatter", x = time(forcastdlD$mean), y = as.numeric(forcastdlD$mean), xaxis = "x", yaxis = "y" ) data <- list(trace1, trace2, trace3, trace4) layout <- list( title = "Forecast from Auto.Arima", xaxis = list( title = "Year", domain = c(0, 1) ), yaxis = list( title = "SAR", domain = c(0, 1) ), margin = list( b = 40, l = 60, r = 10, t = 25 ) ) p <- plot_ly() p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis) p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron) p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron) p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis) p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin) p ``` Column ------------------------------------- ### Mumbai ```{r} datamumbai <- read.csv("datamumbai.csv") datamumbai$Date<-as.Date(datamumbai$Date) ydl=as.ts(AQIM.ts[1:1500]) xdl=as.matrix(datamumbai[1:1500,-1]) ARIMAXM2.out=auto.arima(y = ydl, xreg = xdl) forcastdlM<-forecast(object = ARIMAXM2.out,h = 365,xreg=as.matrix(datamumbai[1501:1865,-1])) #forcastdlM library(plotly) trace1 <- list( line = list( color = "rgba(0,0,0,1)", fillcolor = "rgba(0,0,0,1)" ), mode = "lines", name = "observed", type = "scatter", x = time(forcastdlM$fitted), y = as.numeric(forcastdlM$fitted), xaxis = "x", yaxis = "y" ) trace2 <- list( fill = "toself", line = list( color = "rgba(242,242,242,1)", fillcolor = "rgba(242,242,242,1)" ), mode = "lines", name = "95% confidence", type = "scatter", x = append(time(forcastdlM$upper),(append((rev(time(forcastdlM$upper))[1]),rev(time(forcastdlM$upper))))), y = append(as.numeric(forcastdlM$lower[,2]),append((rev(as.numeric(forcastdlM$upper[,2]))[1]),rev(as.numeric(forcastdlM$upper[,2])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace3 <- list( fill = "toself", line = list( color = "rgba(204,204,204,1)", fillcolor = "rgba(204,204,204,1)" ), mode = "lines", name = "80% confidence", type = "scatter", x = append(time(forcastdlM$upper),(append((rev(time(forcastdlM$upper))[1]),rev(time(forcastdlM$upper))))), y = append(as.numeric(forcastdlM$lower[,1]),append((rev(as.numeric(forcastdlM$upper[,1]))[1]),rev(as.numeric(forcastdlM$upper[,1])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace4 <- list( line = list( color = "rgba(0,0,255,1)", fillcolor = "rgba(0,0,255,1)" ), mode = "lines", name = "prediction", type = "scatter", x = time(forcastdlM$mean), y = as.numeric(forcastdlM$mean), xaxis = "x", yaxis = "y" ) data <- list(trace1, trace2, trace3, trace4) layout <- list( title = "Forecast from Auto.Arima", xaxis = list( title = "Year", domain = c(0, 1) ), yaxis = list( title = "SAR", domain = c(0, 1) ), margin = list( b = 40, l = 60, r = 10, t = 25 ) ) p <- plot_ly() p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis) p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron) p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron) p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis) p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin) p ``` ### Chennai ```{r} datachennai <- read.csv("datachennai.csv") datachennai$Date<-as.Date(datachennai$Date) ydl=as.ts(AQIC.ts[1:1500]) xdl=as.matrix(datachennai[1:1500,-1]) ARIMAXC2.out=auto.arima(y = ydl, xreg = xdl) forcastdlC<-forecast(object = ARIMAXC2.out,h = 365,xreg=as.matrix(datachennai[1501:1865,-1])) #forcastdlC library(plotly) trace1 <- list( line = list( color = "rgba(0,0,0,1)", fillcolor = "rgba(0,0,0,1)" ), mode = "lines", name = "observed", type = "scatter", x = time(forcastdlC$fitted), y = as.numeric(forcastdlC$fitted), xaxis = "x", yaxis = "y" ) trace2 <- list( fill = "toself", line = list( color = "rgba(242,242,242,1)", fillcolor = "rgba(242,242,242,1)" ), mode = "lines", name = "95% confidence", type = "scatter", x = append(time(forcastdlC$upper),(append((rev(time(forcastdlC$upper))[1]),rev(time(forcastdlC$upper))))), y = append(as.numeric(forcastdlC$lower[,2]),append((rev(as.numeric(forcastdlC$upper[,2]))[1]),rev(as.numeric(forcastdlC$upper[,2])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace3 <- list( fill = "toself", line = list( color = "rgba(204,204,204,1)", fillcolor = "rgba(204,204,204,1)" ), mode = "lines", name = "80% confidence", type = "scatter", x = append(time(forcastdlC$upper),(append((rev(time(forcastdlC$upper))[1]),rev(time(forcastdlC$upper))))), y = append(as.numeric(forcastdlC$lower[,1]),append((rev(as.numeric(forcastdlC$upper[,1]))[1]),rev(as.numeric(forcastdlC$upper[,1])))), xaxis = "x", yaxis = "y", hoveron = "points" ) trace4 <- list( line = list( color = "rgba(0,0,255,1)", fillcolor = "rgba(0,0,255,1)" ), mode = "lines", name = "prediction", type = "scatter", x = time(forcastdlC$mean), y = as.numeric(forcastdlC$mean), xaxis = "x", yaxis = "y" ) data <- list(trace1, trace2, trace3, trace4) layout <- list( title = "Forecast from Auto.Arima", xaxis = list( title = "Year", domain = c(0, 1) ), yaxis = list( title = "SAR", domain = c(0, 1) ), margin = list( b = 40, l = 60, r = 10, t = 25 ) ) p <- plot_ly() p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis) p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron) p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron) p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis) p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin) p ``` Model Results SARIMAX { data-navmenu="Time Series Analysis"} =================================================================== ### Importance {.tabset} ------------------------------------- ### Delhi ```{r} print("Delhi") summary(ARIMAXD2.out) ``` ### Mumbai ```{r} print("Mumbai") summary(ARIMAXM2.out) ``` ### Chennai ```{r} print("Chennai") summary(ARIMAXC2.out) ``` Ljung - Box Residual Test { data-navmenu="Time Series Analysis"} =================================================================== ### Importance {.tabset} ------------------------------------- ### Delhi ```{r} Box.test(ARIMAXD.out$residuals,lag=10,type="Ljung") Box.test(x = ARIMAXD.out$residuals^2,lag=1,type="Ljung") ``` ### Mumbai ```{r} Box.test(ARIMAXM.out$residuals,lag=10,type="Ljung") Box.test(x = ARIMAXM.out$residuals^2,lag=1,type="Ljung") ``` ### Chennai ```{r} Box.test(ARIMAXC.out$residuals,lag=10,type="Ljung") Box.test(x = ARIMAXC.out$residuals^2,lag=1,type="Ljung") ``` Comparison & Results {data-navmenu="Time Series Analysis"} =================================================================== Column ------------------------------------- ### Conclusion

Conclusion

In this project we have identified the SARIMA function using Hyndman & Khandakar algorithm in auto.arima to be the best model for our problem. Due to the limitations of computing power and relatively low quality of data, we were only able to forecast for Delhi and Mumbai. With distributed computing, this can be overcome. The model can be automated and scaled with better quality data to make an interactive dashboard for policy makers to make decisions and to track climate change and fluctuations. Technologies like HDFS, Map Reduce with R can be on a cloud service provider used for scaling.


Thank You

### Delhi Final Prediction ```{r} figD ``` Column ------------------------------------- ### Limitations

Limitations

The Ljung Box test confirmed conditional heteroscedasticity. GARCH models were fitted but the standard deviations were too high. The very nature of this data is highly volatile. Therefore we have obtained a reasonable prediction. Usage of a fourier curve for the seasonal component or modern techniques like LSTM Neural Nets might give better results.

### Mumbai Final Prediction ```{r} figM ``` Web Scrapper ========================================= ### Importance {.tabset} ------------------------------------- ### Page 1 ```{r, eval=FALSE, echo=TRUE} # This code was used to scrape meterological data from the web data<-NULL for ( year in 2019:2020) { for(month in 1:12) { monthname<-c('january','february','march','april','may','june','july','august','september','october','november','december') if (monthname[month] %in% c('january','march','may','july','august','october','december')){j=31} else {j=30} if(month==2&year%%4==0){j=29} else if (month==2&year%%4!=0){j=28} for(day in 1:j) { url<-read_html(paste0('https://en.tutiempo.net/records/vidp/',day,'-',monthname[month],'-',year,'.html')) temp.raw<-url%>% html_nodes('.Temp')%>% html_text() temp.raw<-gsub('[^\x20-\x7E]', '', temp.raw) humid.raw<-url%>% html_nodes('.hr')%>% html_text() humid.raw<-str_remove(humid.raw,'%') wind.raw<-url%>% html_nodes('.wind')%>% html_text() ``` ### Page 2 ```{r , eval=FALSE, echo=TRUE} wind.raw<-str_remove(wind.raw,' km/h') wind.raw<-na.omit(str_extract(wind.raw, '[[:digit:]]+')) pressure.raw<-url%>% html_nodes('.prob')%>% html_text() pressure.raw<-str_remove(pressure.raw,'hPa') humidity<-mean(na.omit(as.integer(humid.raw))) temp<-mean(na.omit(as.integer(temp.raw))) pressure<-mean(na.omit(as.integer(pressure.raw))) wind<-mean(na.omit(as.integer(wind.raw))) Date<-as.Date(paste0(year,'-',month,'-',day)) datatemp<-data.frame(Date,temp,pressure,humidity,wind) data<-rbind(data,datatemp) } } nametemp<-paste('data',year,sep='_') assign(nametemp,data) } data2<-rbind(data_2015,data_2016,data_2017,data_2018,data_2019,data_2020) write.csv(data2,'data2.csv', row.names = FALSE) ``` Credits ========================================= Column ------------------------------------- ### Group No

Group No

28

### Member

Wellington Daniel

2016IPM118

### Member

Oshin Singhal

2016IPM070

Column ------------------------------------- ### Section

Section

A

### Member

Shashank Giri

2016IPM094

### Member

Aditya Badonia

2016IPM004